The dataset (https://www.kaggle.com/shivam2503/diamonds) studied for this lab presents ~54K polished diamonds and their attributes. Among those attributes are what are commonly referred to as the 4Cs, those being color, clarity, cut, and carat. In addition, the data set presents certain measurements of the diamond. Finally, it presents the diamonds price.
For this lab, we attempt to predict the price (which ranges from \$328 to \\$18,823) of a diamond given its attributes. This would normally be a regression, however, we transform the problem to a classification problem by stratifying the data. Specifically, we divide the price into “bins.” This has a real-world application for general consumers. A consumer may be interested in ensuring that he is paying a fair price for a diamond given its attributes. Likewise, a consumer may be interested in obtaining the “best” diamond for a given price point, perhaps sacrificing color, cut, clarity, in favor of a larger size. An investor or a diamond merchant may find this application useful, but is likely to already have a sophisticated method for determining price. An appropriate measure of success for the general consumer would be the ability to predict a diamond’s value within 5-10% of its true market price.
This dataset contains 43K carats of the estimated 1.2B carats worldwide (https://www.statista.com/topics/1704/diamond-industry/), or approximately 0.004%, and was last updated in 2017. Unfortunately, little information is provided regarding the data itself outside and explanation of its features. Relevant information such as the date of collection (relevant due to fluctuating commodity markets), the shape (round, oval, heart, etc.), etc. are not included. We attempted to contact the author to gain a better understanding of the data, however, he was unavailable. Despite these shortcomings, we felt this a good dataset for this lab as it conforms to the requirements, presents several data types for analysis, seems challenging, and was interesting.
Diamonds are unique, however, a great deal of effort has gone into classifying them throughout history. Generally, speaking, the rarer the diamond, the higher the price, and the rarity is generally determined through the 4Cs which are described in more detail below. For refernce, the graphic below shows the relevasnt aspects of a diamond
Much of the following information is readily available through the Gemological Institute of America Inc (GIA). Refer to https://4cs.gia.edu/en-us/diamond-cut/ and https://www.gia.edu/diamond-quality-factor
As shown below, diamonds come in many different shapes. Shape is a generally a personal preference, however, it is commonly thought that round diamonds command the highest prices, primarily because of inherent brilliance of this shape. While the data set does not include the shape of the gems, we may infer that they are in fact round or square diamonds by comparing the provided length and width. The difference thereof ranges between 0 and 50.81mm with an average difference of 0.05mm and a standard deviation of 0.25mm
Diamonds may appear anywhere from yellow/brown to colorless, the latter being the rarest. Diamond graders typically grade according to the GIA scale presented below. The dataset used for this lab includes only diamonds of the near-colorless and colorless quality. Such diamonds are likely to command higher prices. As may be seen, the data set contains a fairly even distribution of colors.
A diamonds clarity deals with imperfections (or lack thereof) typically internal to the diamonds structure referred to as inclusions. Of course, diamonds with little to no flaws are rare and command higher prices. Again, most diamond are graded based on the GIA scale shown below. The dataset for this lab presents a subset of clarities, ranging from internally-flawless to I$_{1}$. Furthermore, the distribtion of data across clarities does not appear very uniform.
Cut deals primarily with the way that the diamond reflects light and should not be confused with its shape which is often casually referred to as “cut” by general consumers. It should be noted that several of the features presented in the dataset are related to cut such as depth and table size. Cut is graded based on the following GIA scale. The dataset excludes the poor-cut category and replaces the excellent-cut with premium and ideal. In addition, the distibution of data across the cuts are skewed.
Carat is a measure of the diamond’s weight and is equivalent to 0.2g (0.007oz). Larger diamonds tend to be rarer and command higher prices. However, diamond retailers often refer to “magic” carat weights such as 1 carat. In other words, a 1 carat diamond may have a significant price difference over a diamond of slightly less weight, all other things being equal, simply because it surpasses the “magic number” of 1 carat. This dataset contains diamonds ranging between 0.2 and 5.1 carats. As may be seen below, that vasty majority of the diamonds are under two carats.
import numpy as np
import pandas as pd
df = pd.read_csv('diamonds.csv')
df
df_cut = df.drop(["Unnamed: 0"], axis=1)
df_cut.head()
# The following was taken from "https://www.kaggle.com/shrutimechlearn/types-of-regression-and-stats-in-depth"
print('The shape of the Dataset is: '+str(df_cut.shape))
print('Number of rows with x == 0: {} '.format((df_cut.x==0).sum()))
print('Number of rows with y == 0: {} '.format((df_cut.y==0).sum()))
print('Number of rows with z == 0: {} '.format((df_cut.z==0).sum()))
print('Number of rows with depth == 0: {} '.format((df_cut.depth==0).sum()))
df_cut[['x','y','z']] = df_cut[['x','y','z']].replace(0,np.NaN)
df_cut.isnull().sum()
df_cut.dropna(inplace=True)
df_cut.shape
Our data does not have any missing values, but it does have data that cannot be mathematically possible. We believe the 0 values are mistakes because the diamond cannot have no dimension to it. These quality issues exist in the data because humans are prone to making errors. The error could have been from a mismeasurement or from entering the measurement in the database incorrectly. Since we cannot verify where the mistake happened and we know that the diamonds have to contain dimension, we have chosen to eliminate the data.
df_cut.info()
Cut, color, and clarity are categorical ordinal data types. Carat, depth, table, price, x, y, and z are continuous data types. Price is an integer. Cut, color, and clarity are objects. Carat, depth, table, x, y, and z are floats. Note that price is later stratified into ranges for analysis and visualization purposes. Furthermore, ordinal categorical types are converted to integers.
print(df_cut['cut'].unique())
print(df_cut['color'].unique())
print(df_cut['clarity'].unique())
df_cut['cut_ord'] = df_cut['cut']
df_cut['color_ord'] = df_cut['color']
df_cut['clarity_ord'] = df_cut['clarity']
df_cut.cut_ord.replace(to_replace=df_cut['cut_ord'].unique(),
value = [4, 3, 2, 1, 0],
inplace=True)
df_cut.color_ord.replace(to_replace=df_cut['color_ord'].unique(),
value = [0, 1, 2, 3, 4, 5, 6],
inplace=True)
df_cut.clarity_ord.replace(to_replace=df_cut['clarity_ord'].unique(),
value = [2, 1, 4, 3, 5, 6, 0, 7],
inplace=True)
df_cut
df_cut.info()
df_cut['volume'] = df_cut['x']*df_cut['y']*df_cut['z']
The diamonds volume is calculated (approximated) by the product of the diamond lengths on the x, y, and z axis. Since information of the diamond shape, which is recognized to be crucial for diamond price, is not present in this analysis we assume the diamonds to be of similar shape.
a = np.linspace(0, 9, 10, dtype=int)
df_cut['price_bin'] = pd.cut(x=df_cut['price'], bins=10, labels=a)
df_cut['price_bin'] = df_cut['price_bin'].astype('int64')
df_cut['price_ranges'] = pd.cut(x=df_cut['price'], bins=10)
df_cut.info()
import matplotlib
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline
import seaborn as sns
plt.figure(figsize=[9,9])
df_cut.price.plot(kind='hist', alpha=0.5, label="cut")
df.price.plot(kind='hist', alpha=0.5, label="original")
plt.legend()
plt.show()
We observe that the price distribution of the diamonds remains unchanged after removing the non-physical features. Thus, we decided to eliminate the rows containing non-physical values instead of imputing. These rows present an insignificant amount of data (0.037% of the dataset).
plt.figure(figsize=[21,9])
plt.suptitle('Distribution of Continuous Diamond Features')
plt.subplot(231)
sns.distplot(df_cut['carat'],color='b')
plt.xlabel('Carat Weight')
plt.ylabel('Frequency')
plt.title('Distribution of Diamond Carat Weight')
plt.subplot(232)
sns.distplot(df_cut['price'],color='g')
plt.xlabel('Price in USD')
plt.ylabel('Frequency')
plt.title('Distribution of Diamond Price')
plt.subplot(233)
sns.distplot(df_cut['depth'],color='r')
plt.xlabel('Depth')
plt.ylabel('Frequency')
plt.title('Depth')
plt.subplot(234)
sns.distplot(df_cut['table'],color='r')
plt.xlabel('Table')
plt.ylabel('Frequency')
plt.title('Table')
plt.subplot(235)
sns.distplot(df_cut['volume'],color='r')
plt.xlabel('Volume')
plt.ylabel('Frequency')
plt.title('Distribution of Diamond Volume', fontsize=20)
plt.tight_layout()
It can be seen that there is a higher abundance of diamonds that have low weight, measured in carat, and price. This provides an indication that lighter and cheaper diamonds are the most common in the market. Interestingly, the distribution of volume does not present a similar trend of depth and table, which are measures of size as well. This might represent the first instance of discrepancy in the dataset as we expect size measures to be highly correlated.
# plot overall cross tab with both groups
plt.figure(figsize=(15,5))
ax1 = plt.subplot(1,3,1)
ax2 = plt.subplot(1,3,2)
ax3 = plt.subplot(1,3,3)
colors = ['tab:blue', 'tab:orange', 'tab:green','tab:red', 'tab:purple', 'tab:brown', 'tab:cyan']
plt.suptitle('Values Frequency for Diamond Categorical Features ')
sns.countplot(x="cut", order=['Ideal', 'Premium', 'Very Good', 'Good', 'Fair'], data=df_cut, ax=ax1)
sns.countplot(x="color", order=['D', 'E', 'F', 'G', 'H', 'I', 'J'], data=df_cut, ax=ax2)
sns.countplot(x="clarity", order=['IF', 'VVS1', 'VVS2', 'VS1', 'VS2', 'SI1', 'SI2', 'I1'], data=df_cut, ax=ax3)
plt.tight_layout()
Note that each chart is ordered from the highest level of the respective attribute (left) to the lowest (right).
The frequency of the various categorical features is presented above. Regarding the cut, which is not an inherit property of the the stone but determined by the processing of it, we observe that the high quality cuts are more frequent, with a decreasing number of instaces as the cut quality decreases.
The color quality varies significantly with the average colored diamonds being the most frequent.
Regarding the clarity, which represents the inclusion of imperfections in the stone, we observe that high clarity is more rare. When the clarity decreases the frequency of instances increases, indicating the rarity of diamonds with few imperfections. The lowest of clarity I1 is the least frequent in our dataset, which might provide an indication that diamonds of lower clarity are excluded from entering the market.
plt.figure(figsize=[18,9])
df_cut_grouped = df_cut.groupby(by='price_ranges')
perc = df_cut_grouped['price'].count()/len(df_cut['price']) *100
plt.suptitle('Price Analysis of Diamond Dataset')
plt.subplot(121)
ax = perc.plot(kind='barh')
plt.xlabel('Pecentage')
plt.ylabel('Price Range ($)')
plt.subplot(122)
sns.violinplot(y=df_cut['price'])
plt.ylabel('Price ($)')
After binning the price into price categories defined by ranges (10 bins of equal size), we can observe that more than 45% of diamonds in the market belong to the cheapest category. We observe that the more expensive a diamond is, the more difficult it is to find. From the violin plot we can see that the average price for a diamond is $2,500.
plt.figure(figsize=(20,10))
ax1 = plt.subplot(2,3,1)
ax2 = plt.subplot(2,3,2)
ax3 = plt.subplot(2,3,3)
plt.suptitle('Price Boxplots of Diamond Categorical Features')
sns.boxplot(x="cut", y="price", order=['Ideal', 'Premium', 'Very Good', 'Good', 'Fair'], data=df_cut, ax=ax1)
sns.boxplot(x="color", y="price", order=['D', 'E', 'F', 'G', 'H', 'I', 'J'], data=df_cut, ax=ax2)
sns.boxplot(x="clarity", y="price", order=['IF', 'VVS1', 'VVS2', 'VS1', 'VS2', 'SI1', 'SI2', 'I1'], data=df_cut, ax=ax3)
top_cut = pd.crosstab([df_cut['cut']], df_cut['price_bin'])
top_color = pd.crosstab([df_cut['color']], df_cut['price_bin'])
top_clarity = pd.crosstab([df_cut['clarity']], df_cut['price_bin'])
Ideal = top_cut.iloc[0,]/sum(top_cut.iloc[0,:]) *100
Premium = top_cut.iloc[1,]/sum(top_cut.iloc[1,:]) *100
Very_good = top_cut.iloc[2,]/sum(top_cut.iloc[2,:]) *100
Good = top_cut.iloc[3,]/sum(top_cut.iloc[3,:]) *100
Fair = top_cut.iloc[4,]/sum(top_cut.iloc[4,:]) *100
top_cut_perc = pd.DataFrame([Ideal, Premium, Very_good, Good, Fair], index=['Ideal', 'Premium', 'Very_good', 'Good', 'Fair'])
D = top_color.iloc[0,]/sum(top_color.iloc[0,:]) *100
E = top_color.iloc[1,]/sum(top_color.iloc[1,:]) *100
F = top_color.iloc[2,]/sum(top_color.iloc[2,:]) *100
G = top_color.iloc[3,]/sum(top_color.iloc[3,:]) *100
H = top_color.iloc[4,]/sum(top_color.iloc[4,:]) *100
I = top_color.iloc[5,]/sum(top_color.iloc[5,:]) *100
J = top_color.iloc[6,]/sum(top_color.iloc[6,:]) *100
top_color_perc = pd.DataFrame([D, E, F, G, H, I, J], index=['D', 'E', 'F', 'G', 'H', 'I', 'J'])
I1 = top_clarity.iloc[0,]/sum(top_clarity.iloc[0,:]) *100
IF = top_clarity.iloc[1,]/sum(top_clarity.iloc[1,:]) *100
SI1 = top_clarity.iloc[2,]/sum(top_clarity.iloc[2,:]) *100
SI2 = top_clarity.iloc[3,]/sum(top_clarity.iloc[3,:]) *100
VS1 = top_clarity.iloc[4,]/sum(top_clarity.iloc[4,:]) *100
VS2 = top_clarity.iloc[5,]/sum(top_clarity.iloc[5,:]) *100
VVS1 = top_clarity.iloc[6,]/sum(top_clarity.iloc[6,:]) *100
VVS2 = top_clarity.iloc[7,]/sum(top_clarity.iloc[7,:]) *100
top_clarity_perc = pd.DataFrame([IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1], index=['IF', 'VVS1', 'VVS2', 'VS1', 'VS2', 'SI1', 'SI2', 'I1'])
# plot overall cross tab with both groups
plt.figure(figsize=(15,10))
ax1 = plt.subplot(1,3,1)
ax2 = plt.subplot(1,3,2)
ax3 = plt.subplot(1,3,3)
plt.suptitle('Price Range Percentages in different Categorical Features Values')
top_cut_perc.plot(kind='bar', stacked=True, ax=ax1, legend=False)
top_color_perc.plot(kind='bar', stacked=True, ax=ax2, legend=False)
top_clarity_perc.plot(kind='bar', stacked=True, ax=ax3, legend=False)
plt.tight_layout()
Interestingly, from the boxplot of the categorical features vs. price shown above we observe that higher quality cut, color, and clarity does not correlate to higher price. We, therefore, investigated the amount of different price ranges that compose the various cuts, color, and clarity. Regarding the cut, we see that the ideal cut diamonds present the least amount of inexpensive diamonds. However, higher price ranges are similarly distributed among the cut qualities. Regarding the color and clarity alike, we observe that the diamonds with better colors or clarity have more inexpensive instances. This provides an indication that the price and color/clarity quality are anti-correlated. This might represent an error in labeling the color, or a wrong data description of the color feature ordering by the creator of the dataset.
to_use = ['carat','cut_ord','color_ord','clarity_ord', 'price_bin', 'volume', 'depth', 'table']
cmap = sns.set(style="darkgrid")
f, ax = plt.subplots(figsize=(8, 8))
plt.title('Features Correlation Heatmaps')
sns.heatmap(df_cut[to_use].corr(), cmap=cmap, annot=True)
As a confirmation of the price analysis of the categorical features presented in Q2, we can observe that cut, clarity, and color are slitghly anti-correlated with the diamond price. We see that the only features of a diamond that are strongly correlated to the price are the carat weight and the volume. Furthermore, we observe that clarity and volume are anti-correlated (-0.29), showing that the it is easier to find impurities in bigger stones. A more in depth investigation of these features is presented in a price vs carat scatterplot below. The latter showed that the cut is not correlated to either the price or carat. On the other hand, even though we do not observe a correlation with the price and color, we can visually see a slight anti-correlation between carat and color quality. The same trend is observed for the clarity. This shows that higher level color and clarity are easy to find in smaller stones. The only feature that is correlated to both price and carat is the volume.
plt.figure(figsize=(20,15))
ax1 = plt.subplot(2,2,1)
ax2 = plt.subplot(2,2,2)
ax3 = plt.subplot(2,2,3)
ax4 = plt.subplot(2,2,4)
print(df_cut['volume'].quantile(0.99)) #Including extreme cases would make the coloring of the plot not interpretable, therefore we color using the 99th percentile as vmax in the coloring
plt.suptitle('Feature colored Price vs Carat Scatterplots')
df_cut.plot.scatter(x='carat', y='price', c='cut_ord', colormap='coolwarm', ax=ax1)
df_cut.plot.scatter(x='carat', y='price', c='color_ord', colormap='coolwarm', ax=ax2)
df_cut.plot.scatter(x='carat', y='price', c='clarity_ord', colormap='coolwarm', ax=ax3)
df_cut.plot.scatter(x='carat', y='price', c='volume', colormap='coolwarm', ax=ax4, vmax=354)
price_mean = df_cut_grouped.mean()
price_mean
price_std = df_cut_grouped.std()
price_std
plt.figure(figsize=(20,20))
to_plot=['cut_ord', 'color_ord', 'clarity_ord', 'carat']
ax1 = plt.subplot(2,1,1)
ax2 = plt.subplot(2,1,2)
plt.suptitle('Average Values of Diamond Features in Different Price Ranges')
price_mean[to_plot].plot(kind='line', figsize=(10,10), rot =10, yerr=price_std, ax=ax1)
price_mean.volume.plot(kind='line', figsize=(10,10), rot =10, yerr=price_std, ax=ax2)
ax2.set_ylabel('Volume mm$^3$')
plt.tight_layout()
A crucial point for a buyer and seller is to have information of which qualities are typical for a certain price-range. A seller needs this information when assessing a sales price. Likewise, a buyer would want this information to understand the fairness of a transaction. We observe that the only clear trend is once again carat weight, which increase with the increase of the price. While the other qualities assume similar average values regardless of the price point, showing a significant standard deviation. To summarize, better qualities in a diamond do not translate to higher price.
import umap
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled = scaler.fit_transform(df_cut[to_use])
scaled
print(scaled.shape)
from sklearn.metrics.pairwise import euclidean_distances
import scipy
from scipy import stats
from scipy.spatial import distance
def pc(low_dim_array, high_dim_array):
# Calculate low dimension distance matrix
# Pairwise distances
low_dim_pairdist = euclidean_distances(low_dim_array, low_dim_array)
high_dim_pairdist = euclidean_distances(high_dim_array, high_dim_array)
# The pairwise matrix is symmetric, retain only the upper half
upper_low_dim_pairdist = np.triu(low_dim_pairdist)
upper_high_dim_pairdist = np.triu(high_dim_pairdist)
# Prepare the array for the Person correlation: flattening
upper_low_dim_pairdist_flat = upper_low_dim_pairdist.flatten()
upper_high_dim_pairdist_flat = upper_high_dim_pairdist.flatten()
#Calculate the Pearson correlation between the distances of data points in low and high embeddings
p_corr, _ = scipy.stats.pearsonr(upper_low_dim_pairdist_flat, upper_high_dim_pairdist_flat)
return p_corr, _
Two crucial hyper-parameters for UMAP usage are the number of neighbors and the minimum distance. The first parameter balances the accuracy of local versus global structure of the data by varying the number of points in the local neighborhood. Small values of this hyper-parameter reflect in high accuracy in local data structure, while large values reflect in high accuracy in representing the global data structure, at the cost of the local one. The minimum distance parameter dictates the minimum distance between data points. Small values allow data clustering, while high parameters favors spread of the data and should better preserve the global data structure.
df_cut_to_use_arr = np.array(df_cut[to_use])
neighbors = np.linspace(0, 100, 5, dtype=int)
distances = np.linspace(0, 1, 5)
neighbors[0], distances[0] = 10, 0.1
pearson_arr = np.zeros((len(neighbors), len(distances)))
print('Neighbors:, Distance:, Pearson correlation:')
for n in range(len(neighbors)):
for d in range(len(distances)):
reducer = umap.UMAP(n_neighbors=neighbors[n], min_dist=distances[d], n_components=2, metric='euclidean')
reduced = reducer.fit_transform(scaled)
p_corr, _ = pc(reduced[::10], df_cut_to_use_arr[::10])
pearson_arr[n,d] = p_corr
print(neighbors[n], distances[d], p_corr)
print(pearson_arr)
min_arr = np.min(pearson_arr)
max_arr = np.max(pearson_arr)
print('Standard deviation in the Pearson Correlation coefficient:')
print(np.std(pearson_arr))
plt.title('Pearson Correlation Coefficient for Different Combination of UMAP Hyper-parameters')
sns.heatmap(pearson_arr, vmin=min_arr, vmax=max_arr, cmap="YlGnBu", yticklabels=neighbors, xticklabels=distances, annot=True)
plt.ylabel('Neighbors')
plt.ylabel('Distances')
In order to evaluate the preservation of the data structure in low dimension, the Pearson correlation (PC) analysis is employed. The distances in the projected low dimensional space should reflect the original distances in the high dimensional space. PC is a measure of linear correlation between two independent variables X and Y. X and Y are the distances in the high and low dimension, respectively. PC ranges from -1 to +1, where -1 indicates total anti-correlation and +1 total correlation.
reducer = umap.UMAP(n_neighbors=100, min_dist=1, n_components=2, metric='euclidean') # Best combination of hyperparameters
reduced = reducer.fit_transform(scaled)
# Place reduced representation of data in dataset
df_cut['UMAP 1'] = reduced[:,0] # First dimension
df_cut['UMAP 2'] = reduced[:,1] # Second dimension
plt.figure(figsize=(20,20))
ax1 = plt.subplot(2,1,1)
ax2 = plt.subplot(2,1,2)
sns.scatterplot(x='UMAP 1', y='UMAP 2', data=df_cut, hue=df_cut['price_ranges'].tolist(), s=20, alpha=0.7, ax=ax1)
sns.scatterplot(x='UMAP 1', y='UMAP 2', data=df_cut, hue=df_cut['cut'].tolist(), s=20, alpha=0.7, ax=ax2)
ax1.set_title('UMAP projection: Price range colored')
ax2.set_title('UMAP projection: Cut quality colored')
ax1.set_xlabel('UMAP 1')
ax2.set_xlabel('UMAP 1')
ax1.set_ylabel('UMAP 2')
ax2.set_ylabel('UMAP 2')
From the UMAP dimensionality reduction we observe that the price ranges are distinguishable in the projected surface. Although, we observe some overlap, fairly clear seperation is presented. Note that a gradual price incresases correspond to increases in both axes, i.e. a diagonal. The ideal cut is spread across all all price ranges, confirming the analysis in Q2. Here we can identify areas where one may obtain both high quality cut at a low price.
We observed that UMAP is capable of creating meaningful projections of the diamond feature space.